About the project_exercise1

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.

The course Introduction to Open Data Science

This course intrigues me and I expect it to be very useful for my following research. The content of the course is very interesting that is why I have registered for the course. The course has very friendly and open environment.

Exercise 2

Describe the work you have done this week and summarize your learning.

1.This is the data from the international survey of Approaches to Learning. It icludes the attitute toward statistics, exam points, age, gender and anwers to the guestions which were combined in the theree variables as deep,surface and strategic questions. Also the observations where the exam points variable is zero were excluded.The data included 166 observations and 7 variables.

  1. The plot illustrates the normality of the errors disributions.The normal distributions implies that the majority of the obserbations are located close to the regression line. Here only a part of the observations are close what is why normality is questionable. The plot illustrates the relationships between the variables.For example, we can see a negative correlation between points and age, surf and stra, points and surf and etc. There are strong correlations between points and age, age and attitude, age and deep, attitude and deep, age and stra, attitude and stra, deep and stra and etc.

  2. I have created a model with the dependent variable points and independent variables attitude, age and stra. The attitude variable has a high significance (p-value).It has 3.480 estimated coefficient with the positive relations with the dependent variable (when the value of the independent variable rises dependent also rises). Age and stra veriables are significant on 0.05 level. Age has a negative etsimate that means that when it rises the points fall. Srta has a positive relation with the points.

  3. The R-square is not very big since the model predicts around 20% of the points.F statistics compares the model with the model with only intercept. Here it has significant p-value which means that my model predicts better than a model with no predictors.

5.Residuals vs Fitted values plot checks the assumption that size on errors should not depend on the explanotary variables.Here only a small part of the obseravations do not fit the pattern that is why there is no problem with the assumption. Normal QQ-plot shows that the erros of the model are normally distributed since points mostly fall on regression line. Residuals vs Leverage diahnostic measures the impact of the single observation to the model. In the plot we can see several observations with the higher leverage which are located too below the line (-4-2) when the line is around 0.

chooseCRANmirror(graphics=FALSE, ind=1)
install.packages("lmtest", repos = "http://cran.us.r-project.org")
## 
## The downloaded binary packages are in
<<<<<<< HEAD
##  /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//Rtmpa7Z52d/downloaded_packages
======= ## /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//RtmpewQAQu/downloaded_packages >>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
learning2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=TRUE)
learning2014
##     gender age attitude     deep  stra     surf points
## 1        F  53      3.7 3.583333 3.375 2.583333     25
## 2        M  55      3.1 2.916667 2.750 3.166667     12
## 3        F  49      2.5 3.500000 3.625 2.250000     24
## 4        M  53      3.5 3.500000 3.125 2.250000     10
## 5        M  49      3.7 3.666667 3.625 2.833333     22
## 6        F  38      3.8 4.750000 3.625 2.416667     21
## 7        M  50      3.5 3.833333 2.250 1.916667     21
## 8        F  37      2.9 3.250000 4.000 2.833333     31
## 9        M  37      3.8 4.333333 4.250 2.166667     24
## 10       F  42      2.1 4.000000 3.500 3.000000     26
## 11       M  37      3.9 3.583333 3.625 2.666667     31
## 12       F  34      3.8 3.833333 4.750 2.416667     31
## 13       F  34      2.4 4.250000 3.625 2.250000     23
## 14       F  34      3.0 3.333333 3.500 2.750000     25
## 15       M  35      2.6 4.166667 1.750 2.333333     21
## 16       F  33      4.1 3.666667 3.875 2.333333     31
## 17       F  32      2.6 4.083333 1.375 2.916667     20
## 18       F  44      2.6 3.500000 3.250 2.500000     22
## 19       M  29      1.7 4.083333 3.000 3.750000      9
## 20       F  30      2.7 4.000000 3.750 2.750000     24
## 21       M  27      3.9 3.916667 2.625 2.333333     28
## 22       M  29      3.4 4.000000 2.375 2.416667     30
## 23       F  31      2.7 4.000000 3.625 3.000000     24
## 24       F  37      2.3 3.666667 2.750 2.416667      9
## 25       F  26      3.7 3.666667 1.750 2.833333     26
## 26       F  26      4.4 4.416667 3.250 3.166667     32
## 27       M  30      4.1 3.916667 4.000 3.000000     32
## 28       F  33      3.7 3.750000 3.625 2.000000     33
## 29       F  33      2.5 3.250000 2.875 3.500000     29
## 30       M  28      3.0 3.583333 3.000 3.750000     30
## 31       M  26      3.4 4.916667 1.625 2.500000     19
## 32       F  27      3.2 3.583333 3.250 2.083333     23
## 33       F  25      2.0 2.916667 3.500 2.416667     19
## 34       F  31      2.4 3.666667 3.000 2.583333     12
## 35       M  20      4.2 4.500000 3.250 1.583333     10
## 36       F  39      1.6 4.083333 1.875 2.833333     11
## 37       M  38      3.1 3.833333 4.375 1.833333     20
## 38       M  24      3.8 3.250000 3.625 2.416667     26
## 39       M  26      3.8 2.333333 2.500 3.250000     31
## 40       M  25      3.3 3.333333 1.250 3.416667     20
## 41       F  30      1.7 4.083333 4.000 3.416667     23
## 42       F  25      2.5 2.916667 3.000 3.166667     12
## 43       M  30      3.2 3.333333 2.500 3.500000     24
## 44       F  48      3.5 3.833333 4.875 2.666667     17
## 45       F  24      3.2 3.666667 5.000 2.416667     29
## 46       F  40      4.2 4.666667 4.375 3.583333     23
## 47       M  25      3.1 3.750000 3.250 2.083333     28
## 48       F  23      3.9 3.416667 4.000 3.750000     31
## 49       F  25      1.9 4.166667 3.125 2.916667     23
## 50       F  23      2.1 2.916667 2.500 2.916667     25
## 51       M  27      2.5 4.166667 3.125 2.416667     18
## 52       M  25      3.2 3.583333 3.250 3.000000     19
## 53       M  23      3.2 2.833333 2.125 3.416667     22
## 54       F  23      2.6 4.000000 2.750 2.916667     25
## 55       F  23      2.3 2.916667 2.375 3.250000     21
## 56       F  45      3.8 3.000000 3.125 3.250000      9
## 57       F  22      2.8 4.083333 4.000 2.333333     28
## 58       F  23      3.3 2.916667 4.000 3.250000     25
## 59       M  21      4.8 3.500000 2.250 2.500000     29
## 60       M  21      4.0 4.333333 3.250 1.750000     33
## 61       F  21      4.0 4.250000 3.625 2.250000     33
## 62       F  21      4.7 3.416667 3.625 2.083333     25
## 63       F  26      2.3 3.083333 2.500 2.833333     18
## 64       F  25      3.1 4.583333 1.875 2.833333     22
## 65       F  26      2.7 3.416667 2.000 2.416667     17
## 66       M  21      4.1 3.416667 1.875 2.250000     25
## 67       F  23      3.4 3.416667 4.000 2.833333     28
## 68       F  22      2.5 3.583333 2.875 2.250000     22
## 69       F  22      2.1 1.583333 3.875 1.833333     26
## 70       F  22      1.4 3.333333 2.500 2.916667     11
## 71       F  23      1.9 4.333333 2.750 2.916667     29
## 72       M  22      3.7 4.416667 4.500 2.083333     22
## 73       M  23      3.2 4.833333 3.375 2.333333     21
## 74       M  24      2.8 3.083333 2.625 2.416667     28
## 75       F  22      4.1 3.000000 4.125 2.750000     33
## 76       F  23      2.5 4.083333 2.625 3.250000     16
## 77       M  22      2.8 4.083333 2.250 1.750000     31
## 78       M  20      3.8 3.750000 2.750 2.583333     22
## 79       M  22      3.1 3.083333 3.000 3.333333     31
## 80       M  21      3.5 4.750000 1.625 2.833333     23
## 81       F  22      3.6 4.250000 1.875 2.500000     26
## 82       F  23      2.6 4.166667 3.375 2.416667     12
## 83       M  21      4.4 4.416667 3.750 2.416667     26
## 84       M  22      4.5 3.833333 2.125 2.583333     31
## 85       M  29      3.2 3.333333 2.375 3.000000     19
## 86       F  29      3.9 3.166667 2.750 2.000000     30
## 87       F  21      2.5 3.166667 3.125 3.416667     12
## 88       M  28      3.3 3.833333 3.500 2.833333     17
## 89       F  21      3.3 4.250000 2.625 2.250000     18
## 90       F  30      3.0 3.833333 3.375 2.750000     19
## 91       F  21      2.9 3.666667 2.250 3.916667     21
## 92       M  23      3.3 3.833333 3.000 2.333333     24
## 93       F  21      3.3 3.833333 4.000 2.750000     28
## 94       F  21      3.5 3.833333 3.500 2.750000     17
## 95       F  20      3.6 3.666667 2.625 2.916667     18
## 96       M  22      3.7 4.333333 2.500 2.083333     17
## 97       M  21      4.2 3.750000 3.750 3.666667     23
## 98       M  21      3.2 4.166667 3.625 2.833333     26
## 99       F  20      5.0 4.000000 4.125 3.416667     28
## 100      M  22      4.7 4.000000 4.375 1.583333     31
## 101      F  20      3.6 4.583333 2.625 2.916667     27
## 102      F  20      3.6 3.666667 4.000 3.000000     25
## 103      M  24      2.9 3.666667 2.750 2.916667     23
## 104      F  20      3.5 3.833333 2.750 2.666667     21
## 105      F  19      4.0 2.583333 1.375 3.000000     27
## 106      F  21      3.5 3.500000 2.250 2.750000     28
## 107      F  21      3.2 3.083333 3.625 3.083333     23
## 108      F  22      2.6 4.250000 3.750 2.500000     21
## 109      F  25      2.0 3.166667 4.000 2.333333     25
## 110      F  21      2.7 3.083333 3.125 3.000000     11
## 111      F  22      3.2 4.166667 3.250 3.000000     19
## 112      F  25      3.3 2.250000 2.125 4.000000     24
## 113      F  20      3.9 3.333333 2.875 3.250000     28
## 114      M  24      3.3 3.083333 1.500 3.500000     21
## 115      F  20      3.0 2.750000 2.500 3.500000     24
## 116      M  21      3.7 3.250000 3.250 3.833333     24
## 117      F  20      2.5 4.000000 3.625 2.916667     20
## 118      F  20      2.9 3.583333 3.875 2.166667     19
## 119      M  31      3.9 4.083333 3.875 1.666667     30
## 120      F  20      3.6 4.250000 2.375 2.083333     22
## 121      F  22      2.9 3.416667 3.000 2.833333     16
## 122      F  22      2.1 3.083333 3.375 3.416667     16
## 123      M  21      3.1 3.500000 2.750 3.333333     19
## 124      M  22      4.0 3.666667 4.500 2.583333     30
## 125      F  21      3.1 4.250000 2.625 2.833333     23
## 126      F  21      2.3 4.250000 2.750 3.333333     19
## 127      F  21      2.8 3.833333 3.250 3.000000     18
## 128      F  21      3.7 4.416667 4.125 2.583333     28
## 129      F  20      2.6 3.500000 3.375 2.416667     21
## 130      F  21      2.4 3.583333 2.750 3.583333     19
## 131      F  25      3.0 3.666667 4.125 2.083333     27
## 132      M  21      2.8 2.083333 3.250 4.333333     24
## 133      F  24      2.9 4.250000 2.875 2.666667     21
## 134      F  20      2.4 3.583333 2.875 3.000000     20
## 135      M  21      3.1 4.000000 2.375 2.666667     28
## 136      F  20      1.9 3.333333 3.875 2.166667     12
## 137      F  20      2.0 3.500000 2.125 2.666667     21
## 138      F  18      3.8 3.166667 4.000 2.250000     28
## 139      F  21      3.4 3.583333 3.250 2.666667     31
## 140      F  19      3.7 3.416667 2.625 3.333333     18
## 141      F  21      2.9 4.250000 2.750 3.500000     25
## 142      F  20      2.3 3.250000 4.000 2.750000     19
## 143      M  21      4.1 4.416667 3.000 2.000000     21
## 144      F  20      2.7 3.250000 3.375 2.833333     16
## 145      F  21      3.5 3.916667 3.875 3.500000      7
## 146      F  20      3.4 3.583333 3.250 2.500000     21
## 147      F  18      3.2 4.500000 3.375 3.166667     17
## 148      M  22      3.3 3.583333 4.125 3.083333     22
## 149      F  22      3.3 3.666667 3.500 2.916667     18
## 150      M  24      3.5 2.583333 2.000 3.166667     25
## 151      F  19      3.2 4.166667 3.625 2.500000     24
## 152      F  20      3.1 3.250000 3.375 3.833333     23
## 153      F  20      2.8 4.333333 2.125 2.250000     23
## 154      F  17      1.7 3.916667 4.625 3.416667     26
## 155      M  19      1.9 2.666667 2.500 3.750000     12
## 156      F  20      3.5 3.083333 2.875 3.000000     32
## 157      F  20      2.4 3.750000 2.750 2.583333     22
## 158      F  20      2.1 4.166667 4.000 3.333333     20
## 159      F  20      2.9 4.166667 2.375 2.833333     21
## 160      F  19      1.9 3.250000 3.875 3.000000     23
## 161      F  19      2.0 4.083333 3.375 2.833333     20
## 162      F  22      4.2 2.916667 1.750 3.166667     28
## 163      M  35      4.1 3.833333 3.000 2.750000     31
## 164      F  18      3.7 3.166667 2.625 3.416667     18
## 165      F  19      3.6 3.416667 2.625 3.000000     30
## 166      M  21      1.8 4.083333 3.375 2.666667     19
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   7
library(ggplot2)
#initialize plot with data and aesthetic mapping
p1 <- ggplot(learning2014, aes(x = attitude, y = points))
#define the visualization type (points)
p2 <- p1 + geom_point()
#add a regression line
p3 <- p2 + geom_smooth(method = "lm")
#add a main title and draw the plot
p4 <- p3 + ggtitle ("Student's attitude versus exam points")
p4

#draw a scatter plot matrix of the variables in learning2014.
#[-1] excludes the first column (gender)
pairs(learning2014[-1])

#access the GGally and ggplot2 libraries
install.packages("GGally")
## 
## The downloaded binary packages are in
<<<<<<< HEAD
##  /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//Rtmpa7Z52d/downloaded_packages
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//Rtmpa7Z52d/downloaded_packages
======= ## /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//RtmpewQAQu/downloaded_packages
install.packages("ggplot2")
## 
## The downloaded binary packages are in
##  /var/folders/8t/x1h4shkx7ms9f1___g0qnkhr0000gn/T//RtmpewQAQu/downloaded_packages
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
p

# fit a linear model
my_model <- lm(points ~ 1, data = learning2014)
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")

qplot(attitude, age, data = learning2014) + geom_smooth(method = "lm")

qplot(attitude, stra, data = learning2014) + geom_smooth(method = "lm")

# create an plot matrix with ggpairs()
ggpairs(learning2014, lower = list(combo = wrap("facethist", bins = 20)))

#create a regression model with 3 multiple explanatory variables
my_model1 <- lm(points ~ attitude + age + stra, data = learning2014)
summary(my_model1)
## 
## Call:
## lm(formula = points ~ attitude + age + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08
par(mfrow = c(2,2))
plot(my_model1)


Exercise 3

alc <- read.csv( file="data/alc.csv", sep=",")
dim(alc)
## [1] 382  35
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

This is a combined data from two original datasets on student achievement in secondary education of two Portuguese schools. It was merged to include only the students who answered the questionnaire in both math and portuguese classes.New column high use was created when’alc_use’ is greater than 2. Data has 382 observations of 35 variables. I choose the 4 following variables to test their relationships with high alcohol consumption. 1. Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart. I suppose that living apart with the students leads to high alcohol consumption.

  1. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent). I suppose that bad family relationships result on high alcohol consumption.

  2. goout - going out with friends (numeric: from 1 - very low to 5 - very high). I suppose that student who often go out with friends consume more alcohol.

  3. romantic - with a romantic relationship (binary: yes or no). I suppose that students who are not in relationships consume more alcohol.

library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G…
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F,…
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 1…
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3…
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T,…
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,…
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3,…
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services, o…
## $ Fjob       <fct> teacher, other, other, services, other, other, other,…
## $ reason     <fct> course, course, other, home, home, reputation, home, …
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, y…
## $ guardian   <fct> mother, father, mother, mother, father, mother, mothe…
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3,…
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2,…
## $ failures   <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no…
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, y…
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no…
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, y…
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes…
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, …
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5,…
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,…
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,…
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1,…
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4,…
## $ absences   <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,…
## $ G1         <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 1…
## $ G2         <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, …
## $ G3         <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
#use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 13,370
## Variables: 2
## $ key   <chr> "school", "school", "school", "school", "school", "school"…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
#draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free")+geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

Bar plots illustrate the following distribution of the variables: Pstatus: the majotiry of students have parents living together romantic: twice more students are no in romantic relations famrel: the majority of the respondents have very good relations with parents,less have excellent and good relations while thre are students who have bad and very bad relations goout: has a normal distribution.

library(dplyr); library(ggplot2)
# produce summary statistics by group
alc %>% group_by(sex, high_use) %>% summarise(count = mean(G3))
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <fct> <lgl>    <dbl>
## 1 F     FALSE     11.4
## 2 F     TRUE      11.7
## 3 M     FALSE     12.2
## 4 M     TRUE      10.3
library(ggplot2)
#initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3))
#define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")

#initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences))
#define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("Student absences by alcohol consumption and sex")

#my boxplots
 g3 <- ggplot(alc, aes(x = high_use, y = goout))
 g3 + geom_boxplot() + ylab("Go out by alcohol consumption")

 g4 <- ggplot(alc, aes(x = high_use, y = famrel))
 g4 + geom_boxplot() + ylab("Family relations by alcohol consumption")

First here is the boxplot from the exercise which shows the relations between high use of alcohol and final grades. Here we seen a lot of outliners with the low grades and several with high grades who high consume alcohol. Next there are boxplots on absences. Both boxpots are located quite low which shows the low level of abcences and true assumtion on higher alcohol use. However there are a lot of outliners who do not miss school and consume more as well as go to school and do not consume.

  1. I have constructed the boxplot on goout and high use of alcohol. The true has a higher probability with the minimum 3 and max 5 which is often.The firts boxplot shows that 75% observanions are located low from 1 to 3 (students who go out rare and for them high use of alcohol is false). However there is one outliner which goes out often (5) and does not drink. -> I have support for my hypothesis about go out and high alcohol.

  2. I have constructed the boxplot on family relations and high use of alcohol. In the fist plot we see that the asumption is false for people who have good (3) and excellent relations with parents (5) which coincides with my assumption + we have outliners who have bad relations with parents. The second boxplot illustrates that the assumtion is true for people who have bad and good relations with one outliner who has bad relations but does not consume much -> I have support for my hypothesis about family relations and high alcohol.

  3. I also have tried to construct boxpolts for variables Pstatus and romantic but they look very stange. -> I suppose there are not significant relations between Pstatus and romantic relations with high alcohol consumption but let’s test it anyway in regression.

#find the model with glm()
m <- glm(high_use ~ famrel + goout+romantic+Pstatus, data = alc, family = "binomial")

#print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ famrel + goout + romantic + Pstatus, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5849  -0.7766  -0.5662   0.9554   2.4378  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8376     0.7049  -2.607  0.00914 ** 
## famrel       -0.3732     0.1340  -2.784  0.00537 ** 
## goout         0.7959     0.1185   6.718 1.84e-11 ***
## romanticyes  -0.2826     0.2658  -1.063  0.28763    
## PstatusT     -0.1019     0.4029  -0.253  0.80027    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 406.92  on 377  degrees of freedom
## AIC: 416.92
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)      famrel       goout romanticyes    PstatusT 
##  -1.8375812  -0.3731507   0.7959347  -0.2826057  -0.1019220
#compute odds ratios (OR)
OR <- coef(m) %>% exp
#compute confidence intervals (CI)
CI <- confint(m)
## Waiting for profiling to be done...
#print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %     97.5 %
## (Intercept) 0.1592020 -3.2514939 -0.4772475
## famrel      0.6885615 -0.6385795 -0.1114287
## goout       2.2165118  0.5699984  1.0354467
## romanticyes 0.7538169 -0.8126779  0.2317825
## PstatusT    0.9031000 -0.8753200  0.7157342

In my model we see that only famrel and goout variables are significant. Famrel coeffcient is -0.3731507 which according to the scale of the variable means that better family relations lead to less alcohol use. Goout coefficient is 0.7959347 which means that often go out with friens leads to higher alcohol consumotion. So my hypothesis regarding these variables are testified while regarding status of parents and romantic relations not. <<<<<<< HEAD Odds show that go out twice increases the probability of alcohol consumption. Good family relations decrease the alcohol consumption by 68%, romantic relations also decrease the alcohol by 75% as well as higher status pf parents by 90%. ======= Odds show a higher than 1 probability of success (high alcohol consumption) for goout. Odds for presence of famreal is also higher (97.5% interval) than absence. >>>>>>> 1ec0b112ee3f4f518ef03287b10fa255ad0529af

#Binary predictions
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

#add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

#use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = "high_use")

#see the last ten original classes, predicted probabilities, and class predictions
select(alc, famrel, goout, high_use, probability, prediction) %>% tail(10)
##     famrel goout high_use probability prediction
## 373      4     2    FALSE  0.13702332   high_use
## 374      5     3     TRUE  0.19506126   high_use
## 375      5     3    FALSE  0.19506126   high_use
## 376      4     3    FALSE  0.20967135   high_use
## 377      5     2    FALSE  0.09855474   high_use
## 378      4     4    FALSE  0.43822544   high_use
## 379      2     2    FALSE  0.25087798   high_use
## 380      1     1    FALSE  0.17994514   high_use
## 381      2     5     TRUE  0.78480132   high_use
## 382      4     1     TRUE  0.06684647   high_use
#tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$high_use)
##         prediction
## high_use FALSE TRUE
##    FALSE   268    0
##    TRUE      0  114
# access dplyr and ggplot2
library(dplyr); library(ggplot2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g6 <- ggplot(alc, aes(x = "high_use", y = "probability"))
# define the geom as points and draw the plot
g6+geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table%>%addmargins
##         prediction
## high_use  high_use       Sum
##    FALSE 0.7015707 0.7015707
##    TRUE  0.2984293 0.2984293
##    Sum   1.0000000 1.0000000
#accuracy and loss function
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0)
## [1] 0.2984293
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7015707
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2513089

I construct a ‘confusion mattrix’ which describes the performance of my model with such predictors as famrel and goout. We see that the majority of cases (268) are cases in which we predicted False and are actually False. There are no cases (0) where we predicted True, but actual is False (so no Type I error) and no cases where we predicted False, True in actual (no Type II error). And there are 114 cases where we predicted True, and True in actual. So we can conlude that my model well predicts the target variable. Next I computed the total proportion of inaccurately classified individuals. We see that the average number of wrong predictions is 0.2984, while the correct predictions is 0.7015707. So the model predicts quite good.

#Cross-validation
#define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

#compute the average number of wrong predictions in the (training) data

#K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
#average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2539267
# 10-fold cross-validation 
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

alc.glm <- glm(high_use ~ famrel+goout, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2460733 0.2460733
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
<<<<<<< HEAD
## [1] 0.2696335 0.2619720
=======
## [1] 0.2643979 0.2548998
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc

Here a made a cross-validation from the DataCamp + made myself a 10-fold cross-validation. My model has a little bit higher prediction error using 10-fold cross-validation) compared to the model introduced in DataCamp (which had about 0.26 error). So it is not better.

#cross-validation with 6 predictors
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

alc.glm <- glm(high_use ~ famrel+goout+sex+absences+freetime+paid, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2225131 0.2205532
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
<<<<<<< HEAD
## [1] 0.2303665 0.2265289
=======
## [1] 0.2198953 0.2154409
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
#cross-validation with 4 predictors: sex, absences, freetime, paid
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

alc.glm <- glm(high_use ~ sex+absences+freetime+paid, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2591623 0.2594707
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
<<<<<<< HEAD
## [1] 0.2617801 0.2617801
=======
## [1] 0.2670157 0.2659809
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
#cross-validation with 4 predictors: famrel, goout, absences, freetime
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

alc.glm <- glm(high_use ~ famrel+goout+absences+freetime, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2486911 0.2479510
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
<<<<<<< HEAD
## [1] 0.2565445 0.2528919
=======
## [1] 0.2460733 0.2415641
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
#cross-validation with 2 predictors: famrel, absences
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)

alc.glm <- glm(high_use ~ famrel+absences, binomial, data = alc)
(cv.err <- cv.glm(alc, alc.glm, cost, K = nrow(alc))$delta)
## [1] 0.2931937 0.2938516
(cv.11.err <- cv.glm(alc, alc.glm, cost, K = 10)$delta)
<<<<<<< HEAD
## [1] 0.2879581 0.2931320
=======
## [1] 0.2931937 0.2966476
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc

Here I Perform cross-validation to compare the performance of different logistic regression models. I first test model with such predictors as famrel, goout, sex, absences, freetime, paid. It is guite a good model since the errors are lower than before and the training error is less than the test error. Now let’s try to decrease the number of predictors. I have tried the model with 4 predictors: sex, absences, freetime, paid. Here again errors are higher and test erroes are higher. I also tried model with such 4 predictors: famrel, goout, absences, freetime. Here errors are higher than in model with 6 predictors but lower than in the previous model. Also testing errors in 10 validation are less than in training, so the model is quite good. In the end I am trying the model with 2 predictors:family relations and absences. Errors are the highest but the testing ones are less than training.

Conclusion:The higher number of the predictors result in the lower errors of the model since the more factors we have the much our model explains. ***

Exercise 4

## Data overview
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
#variables distribution
p <- pairs(Boston)

p
## NULL
#calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 
#print the correlation matrix
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
#visualize the correlation matrix
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## ── Attaching packages ───────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ tibble  2.1.3     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ MASS::select()  masks dplyr::select()
corrplot(cor_matrix, method="circle")

corrplot(cor_matrix, method="circle", type = "upper",cl.pos = "b", tl.pos = "d",tl.cex = 0.6)%>%round()

##         crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
## crim       1  0     0    0   0  0   0   0   1   1       0     0     0    0
## zn         0  1    -1    0  -1  0  -1   1   0   0       0     0     0    0
## indus      0 -1     1    0   1  0   1  -1   1   1       0     0     1    0
## chas       0  0     0    1   0  0   0   0   0   0       0     0     0    0
## nox        0 -1     1    0   1  0   1  -1   1   1       0     0     1    0
## rm         0  0     0    0   0  1   0   0   0   0       0     0    -1    1
## age        0 -1     1    0   1  0   1  -1   0   1       0     0     1    0
## dis        0  1    -1    0  -1  0  -1   1   0  -1       0     0     0    0
## rad        1  0     1    0   1  0   0   0   1   1       0     0     0    0
## tax        1  0     1    0   1  0   1  -1   1   1       0     0     1    0
## ptratio    0  0     0    0   0  0   0   0   0   0       1     0     0   -1
## black      0  0     0    0   0  0   0   0   0   0       0     1     0    0
## lstat      0  0     1    0   1 -1   1   0   0   1       0     0     1   -1
## medv       0  0     0    0   0  1   0   0   0   0      -1     0    -1    1
<<<<<<< HEAD

This data has 506 obs. of 14 variables which contain information about different indicators of Boston city as per capita crime rate, index of accessibility to radial highways and etc. In the first plot we see the distibution of variables. It looks like only rm has a normal distribution while crime, chas, lstat are shifted to the left, age to the right and etc. In the second plot positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. For example, the is strong positive correlation between such variables as crime and rad and tax, between indus and nox, age and tax. Negative correlations between dis and nox,age and dis, medv and lstat.

=======

This data has 506 obs. of 14 variables which contain information about different indicators of Boston city as per capita crime rate, index of accessibility to radial highways and etc. In the fist plot we seen the distibution of variables. It looks like only rm has a normal distribution while crime, chas, lstat are shifted to the left, age to the right and etc. In the second plot positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients. For example, the is strong positive correlation between such variables as crime and rad and tax, between indus and nox, age and tax. Negative correlations between dis and nox,age and dis, medv and lstat.

>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
## Scaling and factor variable

#center and standardize variables
boston_scaled <- scale (Boston)
#summaries of the scaled variables
summary(boston_scaled) 
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
#change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
#create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#create a categorical variable 'crime'
vector <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=vector)
#look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
#add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

## Train and test sets
#number of rows in the Boston dataset 
n <- nrow(boston_scaled)
#choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
#create train set
train <- boston_scaled[ind,]
#create test set 
test <- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes <- c(test$crime)
#remove the crime variable from test data
test <- dplyr::select(test, -crime)

str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  -0.487 -0.487 -0.487 -0.487 -0.487 ...
<<<<<<< HEAD
##  $ indus  : num  -1.033 1.015 -0.164 2.42 -0.72 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 3.665 ...
##  $ nox    : num  -0.3857 0.3651 -0.0664 0.4686 -0.4116 ...
##  $ rm     : num  0.187 0.53 -0.508 -0.429 0.628 ...
##  $ age    : num  0.5515 1.1164 0.6972 0.5302 -0.0737 ...
##  $ dis    : num  -0.546 -1.077 -0.633 -0.8 -0.068 ...
##  $ rad    : num  -0.522 1.66 -0.408 -0.637 -0.178 ...
##  $ tax    : num  -0.666 1.529 0.141 1.796 -0.601 ...
##  $ ptratio: num  -0.857 0.806 -0.303 0.76 -0.488 ...
##  $ black  : num  0.4253 -3.5229 -0.1289 0.4406 0.0386 ...
##  $ lstat  : num  -0.506 1.1997 0.4351 0.0976 -0.6446 ...
##  $ medv   : num  0.116 -0.58 -0.46 -0.265 0.703 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 4 2 2 3 4 2 1 2 4 ...
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  0.2845 0.0487 0.0487 -0.4872 -0.4872 ...
##  $ indus  : num  -1.287 -0.476 -0.476 -0.437 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.265 -0.265 -0.144 -0.144 ...
##  $ rm     : num  0.413 -0.93 0.131 -1.179 -0.455 ...
##  $ age    : num  -0.12 1.116 0.914 -1.136 0.733 ...
##  $ dis    : num  0.140075 1.086122 1.21178 0.000692 0.103175 ...
##  $ rad    : num  -0.982 -0.522 -0.522 -0.637 -0.637 ...
##  $ tax    : num  -0.666 -0.577 -0.577 -0.601 -0.601 ...
##  $ ptratio: num  -1.46 -1.5 -1.5 1.18 1.18 ...
##  $ black  : num  0.441 0.328 0.393 -0.741 0.393 ...
##  $ lstat  : num  -1.074 2.419 1.092 -0.135 0.165 ...
##  $ medv   : num  0.16 -0.656 -0.819 -0.254 -0.319 ...
======= ## $ indus : num -0.211 -0.548 -0.18 -0.164 1.567 ... ## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ... ## $ nox : num 0.2615 -0.5324 -0.0923 -0.0664 0.5981 ... ## $ rm : num -1.019 0.186 -1.867 -0.508 -0.609 ... ## $ age : num 0.175 -1.012 -1.093 0.697 1.052 ... ## $ dis : num -0.663 0.44 -0.606 -0.633 -1.01 ... ## $ rad : num -0.408 -0.522 -0.637 -0.408 -0.637 ... ## $ tax : num -0.102 -0.719 -0.618 0.141 0.171 ... ## $ ptratio: num 0.3439 0.5286 -0.0257 -0.3028 1.2677 ... ## $ black : num 0.4282 0.4406 -0.0682 -0.1289 0.3874 ... ## $ lstat : num 0.34266 -0.91486 -0.00183 0.43508 1.21368 ... ## $ medv : num -0.547 0.268 -0.699 -0.46 -1.004 ... ## $ crime : Factor w/ 4 levels "low","med_low",..: 2 3 3 2 2 2 1 4 2 4 ...
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  -0.4872 -0.4872 0.0487 0.0487 0.0487 ...
##  $ indus  : num  -0.593 -0.593 -0.476 -0.476 -0.476 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.74 -0.74 -0.265 -0.265 -0.265 ...
##  $ rm     : num  0.194 1.281 -0.16 -0.93 -0.563 ...
##  $ age    : num  0.367 -0.266 0.978 1.116 -1.051 ...
##  $ dis    : num  0.557 0.557 1.024 1.086 0.786 ...
##  $ rad    : num  -0.867 -0.867 -0.522 -0.522 -0.522 ...
##  $ tax    : num  -0.986 -0.986 -0.577 -0.577 -0.577 ...
##  $ ptratio: num  -0.303 -0.303 -1.504 -1.504 -1.504 ...
##  $ black  : num  0.441 0.396 0.441 0.328 0.371 ...
##  $ lstat  : num  -0.492 -1.208 0.91 2.419 0.428 ...
##  $ medv   : num  -0.1014 1.3229 0.4966 -0.6559 -0.0906 ...
>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc

Here we scale the data which is an operation when we subtract the column means from the corresponding columns and divide the difference with standard deviation. It helps us to have normal distribution of variables later used in clastering. When we create a factor variable’crim’ and use the quantiles as the break points to the variable.Later we divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

# Linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
#print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
<<<<<<< HEAD
## 0.2500000 0.2351485 0.2599010 0.2549505 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.91794122 -0.8964476 -0.155385496 -0.8708570  0.4612307
## med_low  -0.07742503 -0.3407435 -0.106556426 -0.5896536 -0.1173697
## med_high -0.38025152  0.1117786  0.177625245  0.3633568  0.1183384
## high     -0.48724019  1.0170891 -0.004759149  1.0366145 -0.3987911
##                 age        dis        rad        tax     ptratio
## low      -0.8740888  0.8265347 -0.6953230 -0.7578877 -0.41118220
## med_low  -0.3824463  0.4012400 -0.5357824 -0.4911322 -0.07281424
## med_high  0.3565856 -0.3583810 -0.3890435 -0.3109617 -0.31379225
## high      0.8228049 -0.8454307  1.6384176  1.5142626  0.78111358
##               black       lstat        medv
## low       0.3853158 -0.78525002  0.55762985
## med_low   0.3141281 -0.20235972  0.03385894
## med_high  0.1143596  0.01385463  0.16646579
## high     -0.7720726  0.92007817 -0.69416009
## 
## Coefficients of linear discriminants:
##                   LD1         LD2         LD3
## zn       0.0846507053  0.76548460 -0.99101074
## indus   -0.0008345807 -0.19960977  0.31553665
## chas    -0.0641603165 -0.10027899  0.07210062
## nox      0.3675570896 -0.66628970 -1.34551901
## rm      -0.0958845586 -0.11424536 -0.21780873
## age      0.3117103860 -0.30660388 -0.03821072
## dis     -0.0517525721 -0.24641974  0.34604917
## rad      2.9517261386  0.94314128 -0.03558019
## tax      0.0541839210 -0.08115445  0.52756593
## ptratio  0.1025626309  0.11170313 -0.32149396
## black   -0.1191068792 -0.03112145  0.07744718
## lstat    0.2400950484 -0.26590388  0.32710883
## medv     0.1845112388 -0.31357167 -0.13482788
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9489 0.0385 0.0126
======= ## 0.2648515 0.2450495 0.2574257 0.2326733 ## ## Group means: ## zn indus chas nox rm ## low 1.0439144 -0.9288987 -0.08835242 -0.8764053 0.4794906 ## med_low -0.1058935 -0.3045329 -0.03371693 -0.5569188 -0.1369742 ## med_high -0.3783982 0.1257995 0.18195173 0.3484040 0.1138193 ## high -0.4872402 1.0172896 -0.02102480 1.0445410 -0.4221093 ## age dis rad tax ptratio ## low -0.8892358 0.8665669 -0.6920711 -0.7598856 -0.48626159 ## med_low -0.3206313 0.3492954 -0.5410455 -0.4807548 -0.02938384 ## med_high 0.3530373 -0.3366877 -0.4032205 -0.3318525 -0.25660396 ## high 0.8319596 -0.8744946 1.6363892 1.5128120 0.77875205 ## black lstat medv ## low 0.37809894 -0.793787281 0.5687383 ## med_low 0.30952994 -0.115880207 -0.0346484 ## med_high 0.08549545 -0.002623744 0.1689371 ## high -0.72403795 0.908518686 -0.7055688 ## ## Coefficients of linear discriminants: ## LD1 LD2 LD3 ## zn 0.078657359 0.857520221 -0.8717584 ## indus -0.002909672 -0.253917146 0.2337975 ## chas -0.095400507 -0.032868284 0.0429066 ## nox 0.383714004 -0.683088253 -1.4310371 ## rm -0.093371248 -0.078440697 -0.1405693 ## age 0.254635879 -0.274787839 -0.1586730 ## dis -0.036158705 -0.432442857 0.1887757 ## rad 3.108280558 0.927058305 -0.1814222 ## tax 0.059522984 -0.071865646 0.7464100 ## ptratio 0.120166093 0.051856618 -0.2547177 ## black -0.099241975 -0.007624795 0.1107453 ## lstat 0.277383331 -0.296213516 0.3767497 ## medv 0.216388335 -0.423361580 -0.2266234 ## ## Proportion of trace: ## LD1 LD2 LD3 ## 0.9457 0.0394 0.0148 >>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

#target classes as numeric
classes <- as.numeric(train$crime)
#plot the lda results
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)
<<<<<<< HEAD

=======

>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc

We have 4 clasters, so train data was devided in 25% 4 times with crime as target variable. In the plot we see that three clasters are overlapping while the 4rd one in quite far for them. We aslo see that such variables as zn, nox, rad, ptratio have a big impact on the model

# Predictors
#create train set
train <- boston_scaled[ind,]
#create test set 
test <- boston_scaled[-ind,]
#save the correct classes from test data
correct_classes <- c(test$crime)
#remove the crime variable from test data
test <- dplyr::select(test, -crime)
#predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##        predicted
## correct low med_low med_high high
<<<<<<< HEAD
##       1  16      10        0    0
##       2   6      14       11    0
##       3   0       2       19    0
##       4   0       0        0   24
======= ## 1 10 10 0 0 ## 2 5 17 5 0 ## 3 0 4 17 1 ## 4 0 0 0 33 >>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc
#target classes as numeric
classes <- as.numeric(correct_classes)
#plot the lda results
plot(lda.fit, dimen = 2,col = classes,pch = classes)
lda.arrows(lda.fit, myscale = 2)
<<<<<<< HEAD

=======

>>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc

In the table we see the relation between the correct classes and the predicted ones. The classifier did not predict the crime rates correctly since predicted numbers are higher than correct. I also tried to vizualize the LDA results for crime in test data and we see that everything is in a mess.

## distances
library(MASS)
data('Boston')
#scale the data
boston_scaled <- scale (Boston)
#euclidean distance matrix
dist_eu <- (boston_scaled)
#look at the summary of the distances
summary(dist_eu)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
#manhattan distance matrix
dist_man <- dist(boston_scaled, method = "manhattan")
#look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(boston_scaled, centers = 1)
km <-kmeans(boston_scaled, centers = 3)
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col=km$cluster)

#determine K
set.seed(123)
library(ggplot2)
#determine the number of clusters
k_max <- 10
#calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
#visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

#k-means clustering
km <-kmeans(boston_scaled, centers = 2)

Based on the calculated distance meajure the k-classtering was maded. On the graph we see a significant change in point 2 - so the optimal number of clastters is 2. On the plot we see the vizualization for 2 clasters - which part of data where belongs.

## 3D plot
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
<<<<<<< HEAD ======= >>>>>>> 77475a1150362e7e1077341fd0cd3c34d7f8d6cc